AITopics | sleeper agent

Collaborating Authors

sleeper agent

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

79eec295a3cd5785e18c61383e7c996b-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 00:15:41 GMT

experiment, resnet-18 model, sleeper agent, (13 more...)

Neural Information Processing Systems

Industry: Information Technology (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback

79eec295a3cd5785e18c61383e7c996b-Paper-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 00:15:37 GMT

backdoor attack, sleeper agent, training data, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland (0.04)
Asia > Nepal (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch

Neural Information Processing SystemsFeb-5-2026, 09:53:34 GMT

As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a trigger'' into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs gradient matching, data selection, and target model re-training during the crafting process. Sleeper Agent is the first hidden trigger backdoor attack to be effective against neural networks trained from scratch. We demonstrate its effectiveness on ImageNet and in black-box settings.

artificial intelligence, machine learning, sleeper agent, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.90)

Add feedback

Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First

Liu, Shu, Ponnapalli, Soujanya, Shankar, Shreya, Zeighami, Sepanta, Zhu, Alan, Agarwal, Shubham, Chen, Ruiqi, Suwito, Samion, Yuan, Shuo, Stoica, Ion, Zaharia, Matei, Cheung, Alvin, Crooks, Natacha, Gonzalez, Joseph E., Parameswaran, Aditya G.

arXiv.org Artificial IntelligenceDec-9-2025

Large Language Model (LLM) agents, acting on their users' behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high-throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present-day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability - to outline a number of new research opportunities for a new agent-first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.00997

Country: North America > United States > California (0.14)

Genre: Research Report (0.40)

Industry: Information Technology (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
(2 more...)

Add feedback

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Zanbaghi, Shahin, Rostampour, Ryan, Abid, Farhan, Jarmakani, Salim Al

arXiv.org Artificial IntelligenceNov-21-2025

Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.

backdoor, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.15992

Country: North America > Canada > Ontario (0.14)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Baker, Mohammed Abu, Babu-Saheer, Lakshmi

arXiv.org Artificial IntelligenceAug-25-2025

Recent advances in artificial intelligence (AI), particularly in the domain of large language models (LLMs), have significantly amplified concerns around AI safety and security. One critical aspect of these concerns is the vulnerability of LLMs to backdoor attacks--a malicious strategy whereby an attacker injects specific triggers into training data, resulting in "sleeper agents" that behave normally until activated by particular inputs [6]. These backdoored models (also known as sleeper agents or trojaned models) pose a serious threat as they cannot be detected by standard evaluation methods and manifest undesirable or harmful behaviors only upon exposure to particular triggers in the input [2]. Triggers can take on many forms, ranging from simple single-token lexical triggers to complex semantic triggers [9]. The significance of studying backdoor vulnerabilities arises from two primary threat models: Data-poisoned sleeper agents These involve deliberate poisoning of the training data to trigger specific harmful behaviors under attacker-defined conditions [3]. Real-world implications are substantial; for instance, autonomous vehicles might misinterpret modified road signs, potentially leading to fatal accidents, or software coding assistants might generate insecure code when prompted by certain organisations, making the organisations software systems vulnerable to attack if the generated code is not carefully inspected [3]. Deceptive instrumental alignment Plausibly, models could develop deceptive behaviors organically during training [8]. These models exhibit compliant behaviors in training and evaluation phases but deviate from their developer-defined goals once deployed. While naturally occurring deceptive models have not yet been reported, the training process does select for such behaviour [6].

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.15847

Country: Europe > United Kingdom (0.14)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

79eec295a3cd5785e18c61383e7c996b-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-16-2025, 04:22:35 GMT

artificial intelligence, machine learning, sleeper agent, (15 more...)

Neural Information Processing Systems

Industry: Information Technology (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback

Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch

Neural Information Processing SystemsAug-16-2025, 04:22:30 GMT

As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat.

artificial intelligence, backdoor attack, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Maryland (0.04)
Asia > Nepal (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch

Neural Information Processing SystemsJan-14-2025, 02:14:08 GMT

neural network, scalable hidden trigger backdoor, sleeper agent, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

Two-faced AI language models learn to hide deception

NatureJan-23-2024

Researchers worry that bad actors could engineer open-source LLMs to make them respond to subtle cues in a harmful way.Credit: Smail Aslanda/Anadolu Just like people, artificial-intelligence (AI) systems can be deliberately deceptive. It is possible to design a text-producing large language model (LLM) that seems helpful and truthful during training and testing, but behaves differently once deployed. And according to a study shared this month on arXiv1, attempts to detect and remove such two-faced behaviour are often useless -- and can even make the models better at hiding their true nature. The finding that trying to retrain deceptive LLMs can make the situation worse "was something that was particularly surprising to us … and potentially scary", says co-author Evan Hubinger, a computer scientist at Anthropic, an AI start-up company in San Francisco, California. Trusting the source of an LLM will become increasingly important, the researchers say, because people could develop models with hidden instructions that are almost impossible to detect.

artificial intelligence, large language model, natural language, (13 more...)

Nature

AI-Alerts: 2024 > 2024-01 > AAAI AI-Alert for Jan 23, 2024 (1.00)

Country: North America > United States > California > San Francisco County > San Francisco (0.56)

Genre: Research Report > New Finding (0.36)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback